import os
from IPython.display import display, HTML
ModuleFolder='C:\\Users\\Gamaliel\\Documents\\G\\ADD\\IBM_DS\\Data_Analysis_Py\\'
os.chdir(ModuleFolder)

Module 1

Lesson Summary

  • Each line in a dataset is a row, and commas separate the values.
  • To understand the data, you must analyze the attributes for each column of data.
  • Python libraries are collections of functions and methods that facilitate various functionalities without writing code from scratch and are categorized into
    • Scientific Computing
    • Data Visualization
    • Machine Learning Algorithms.
  • Many data science libraries are interconnected; for instance, Scikit-learn is built on top of NumPy, SciPy, and Matplotlib.
  • The data format and the file path are two key factors for reading data with Pandas.
  • The read_CSV method in Pandas can read files in CSV format into a Pandas DataFrame.
  • Pandas has unique data types like object, float, Int, and datetime.
  • Use the dtype method to check each column’s data type; misclassified data types might need manual correction.
  • Knowing the correct data types helps apply appropriate Python functions to specific columns.
  • Using Statistical Summary with describe() provides count, mean, standard deviation, min, max, and quartile ranges for numerical columns.
  • You can also use include='all' as an argument to get summaries for object-type columns.
  • The statistical summary helps identify potential issues like outliers needing further attention.
  • Using the info() Method gives an overview of the top and bottom 30 rows of the DataFrame, useful for quick visual inspection.
  • Some statistical metrics may return "NaN," indicating missing values, and the program can’t calculate statistics for that specific data type.
  • Python can connect to databases through specialized code, often written in Jupyter notebooks.
  • SQL Application Programming Interfaces (APIs) and Python DB APIs (most often used) facilitate the interaction between Python and the DBMS.
  • SQL APIs connect to DBMS with one or more API calls, build SQL statements as a text string, and use API calls to send SQL statements to the DBMS and retrieve results and statuses.
  • DB-API, Python's standard for interacting with relational databases, uses connection objects to establish and manage database connections and cursor objects to run queries and scroll through the results.
  • Connection Object methods include the cursor(), commit(), rollback(), and close() commands.
  • You can import the database module, use the Connect API to open a connection, and then create a cursor object to run queries and fetch results.
  • Remember to close the database connection to free up resources.
#from pyodide.http import pyfetch

#async def download(url, filename):
 #   response = await pyfetch(url)
  #  if response.status == 200:
   #     with open(filename, "wb") as f:
    #        f.write(await response.bytes())
import pandas as pd
Jupyter_Notesath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_mod1.csv"
df = pd.read_csv(Jupyter_Notesath, header=0)
df
Unnamed: 0 Manufacturer Category Screen GPU OS CPU_core Screen_Size_cm CPU_frequency RAM_GB Storage_GB_SSD Weight_kg Price
0 0 Acer 4 IPS Panel 2 1 5 35.560 1.6 8 256 1.60 978
1 1 Dell 3 Full HD 1 1 3 39.624 2.0 4 256 2.20 634
2 2 Dell 3 Full HD 1 1 7 39.624 2.7 8 256 2.20 946
3 3 Dell 4 IPS Panel 2 1 5 33.782 1.6 8 128 1.22 1244
4 4 HP 4 Full HD 2 1 7 39.624 1.8 8 256 1.91 837
... ... ... ... ... ... ... ... ... ... ... ... ... ...
233 233 Lenovo 4 IPS Panel 2 1 7 35.560 2.6 8 256 1.70 1891
234 234 Toshiba 3 Full HD 2 1 5 33.782 2.4 8 256 1.20 1950
235 235 Lenovo 4 IPS Panel 2 1 5 30.480 2.6 8 256 1.36 2236
236 236 Lenovo 3 Full HD 3 1 5 39.624 2.5 6 256 2.40 883
237 237 Toshiba 3 Full HD 2 1 5 35.560 2.3 8 256 1.95 1499

238 rows × 13 columns

Module 3

  • Tools like the 'describe' function in pandas can quickly calculate key statistical measures like mean, standard deviation, and quartiles for all numerical variables in your data frame.
  • Use the 'value_counts' function to summarize data into different categories for categorical data.
  • Box plots offer a more visual representation of the data's distribution for numerical data, indicating features like the median, quartiles, and outliers.
  • Scatter plots are excellent for exploring relationships between continuous variables, like engine size and price, in a car data set.
  • Use Pandas' 'groupby' method to explore relationships between categorical variables.
  • Use pivot tables and heat maps for better data visualizations.
  • Correlation between variables is a statistical measure that indicates how the changes in one variable might be associated with changes in another variable.
  • When exploring correlation, use scatter plots combined with a regression line to visualize relationships between variables.
  • Visualization functions like regplot, from the seaborn library, are especially useful for exploring correlation.
  • The Pearson correlation, a key method for assessing the correlation between continuous numerical variables, provides two critical values—the coefficient, which indicates the strength and direction of the correlation, and the P-value, which assesses the certainty of the correlation.
  • A correlation coefficient close to 1 or -1 indicates a strong positive or negative correlation, respectively, while one close to zero suggests no correlation.
  • For P-values, values less than .001 indicate strong certainty in the correlation, while larger values indicate less certainty. Both the coefficient and P-value are important for confirming a strong correlation.
  • Heatmaps provide a comprehensive visual summary of the strength and direction of correlations among multiple variables.

Module 4

Notebook for EDA

Import Data from Module 2

Setup

Import libraries:

#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#!  mamba install seaborn=0.9.0-y
import pandas as pd
import numpy as np

Download the updated dataset by running the cell below.

The functions below will download the dataset into your browser and store it in dataframe df:

file_path= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
#await download(file_path, "usedcars.csv")
file_name="usedcars.csv"
df = pd.read_csv(file_path, header=0)

Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply skip the steps above, and simply use the URL directly in the pandas.read_csv() function. You can uncomment and run the statements in the cell below.

#Jupyter_Notesath='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'
#df = pd.read_csv(Jupyter_Notesath, header=None)

View the first 5 values of the updated dataframe using dataframe.head()

df.head()
symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned diesel gas
0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 13495.0 11.190476 Medium 0 1
1 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 16500.0 11.190476 Medium 0 1
2 1 122 alfa-romero std two hatchback rwd front 94.5 0.822681 ... 9.0 154.0 5000.0 19 26 16500.0 12.368421 Medium 0 1
3 2 164 audi std four sedan fwd front 99.8 0.848630 ... 10.0 102.0 5500.0 24 30 13950.0 9.791667 Medium 0 1
4 2 164 audi std four sedan 4wd front 99.4 0.848630 ... 8.0 115.0 5500.0 18 22 17450.0 13.055556 Medium 0 1

5 rows × 29 columns

Analyzing Individual Feature Patterns Using Visualization

To install Seaborn we use pip, the Python package manager.

Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib inline" to plot in a Jupyter notebook.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

How to choose the right visualization method?

When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.

# list the data types for each column
print(df.dtypes)
symboling              int64
normalized-losses      int64
make                  object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
city-L/100km         float64
horsepower-binned     object
diesel                 int64
gas                    int64
dtype: object
df.info
<bound method DataFrame.info of      symboling  normalized-losses         make aspiration num-of-doors  \
0            3                122  alfa-romero        std          two   
1            3                122  alfa-romero        std          two   
2            1                122  alfa-romero        std          two   
3            2                164         audi        std         four   
4            2                164         audi        std         four   
..         ...                ...          ...        ...          ...   
196         -1                 95        volvo        std         four   
197         -1                 95        volvo      turbo         four   
198         -1                 95        volvo        std         four   
199         -1                 95        volvo      turbo         four   
200         -1                 95        volvo      turbo         four   

      body-style drive-wheels engine-location  wheel-base    length  ...  \
0    convertible          rwd           front        88.6  0.811148  ...   
1    convertible          rwd           front        88.6  0.811148  ...   
2      hatchback          rwd           front        94.5  0.822681  ...   
3          sedan          fwd           front        99.8  0.848630  ...   
4          sedan          4wd           front        99.4  0.848630  ...   
..           ...          ...             ...         ...       ...  ...   
196        sedan          rwd           front       109.1  0.907256  ...   
197        sedan          rwd           front       109.1  0.907256  ...   
198        sedan          rwd           front       109.1  0.907256  ...   
199        sedan          rwd           front       109.1  0.907256  ...   
200        sedan          rwd           front       109.1  0.907256  ...   

     compression-ratio  horsepower  peak-rpm city-mpg highway-mpg    price  \
0                  9.0       111.0    5000.0       21          27  13495.0   
1                  9.0       111.0    5000.0       21          27  16500.0   
2                  9.0       154.0    5000.0       19          26  16500.0   
3                 10.0       102.0    5500.0       24          30  13950.0   
4                  8.0       115.0    5500.0       18          22  17450.0   
..                 ...         ...       ...      ...         ...      ...   
196                9.5       114.0    5400.0       23          28  16845.0   
197                8.7       160.0    5300.0       19          25  19045.0   
198                8.8       134.0    5500.0       18          23  21485.0   
199               23.0       106.0    4800.0       26          27  22470.0   
200                9.5       114.0    5400.0       19          25  22625.0   

    city-L/100km  horsepower-binned  diesel  gas  
0      11.190476             Medium       0    1  
1      11.190476             Medium       0    1  
2      12.368421             Medium       0    1  
3       9.791667             Medium       0    1  
4      13.055556             Medium       0    1  
..           ...                ...     ...  ...  
196    10.217391             Medium       0    1  
197    12.368421               High       0    1  
198    13.055556             Medium       0    1  
199     9.038462             Medium       1    0  
200    12.368421             Medium       0    1  

[201 rows x 29 columns]>

Question #1:

What is the data type of the column "peak-rpm"?
# Write your code below and press Shift+Enter to execute 
df['peak-rpm'].dtypes
dtype('float64')
Click here for the solution ```python df['peak-rpm'].dtypes ```

For example, we can calculate the correlation between variables of type "int64" or "float64" using the method "corr":

# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['float64', 'int64'])
corres=numeric_df.corr()
fix, axs=plt.subplots()
mat=axs.pcolor(corres, cmap='coolwarm')
cols=list(numeric_df.columns)
#print(cols)
axs.set_xticks(list(np.arange(0.5,len(cols)+.5,1)))
plt.xticks(rotation=90)
axs.set_xticklabels(cols)
axs.set_yticks(list(np.arange(.5,len(cols)+.5,1)))
axs.set_yticklabels(cols)
axs.set_aspect('equal', 'box')
fix.colorbar(mat)
fix.tight_layout()
plt.title('Car properties correlation')
plt.show();

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.

Question #2:

Find the correlation between the following columns: bore, stroke, compression-ratio, and horsepower.

Hint: if you would like to select those columns, use the following syntax: df[['bore','stroke','compression-ratio','horsepower']]

# Write your code below and press Shift+Enter to execute 
df[['bore','stroke','compression-ratio','horsepower']].corr()
bore stroke compression-ratio horsepower
bore 1.000000 -0.055390 0.001263 0.566936
stroke -0.055390 1.000000 0.187923 0.098462
compression-ratio 0.001263 0.187923 1.000000 -0.214514
horsepower 0.566936 0.098462 -0.214514 1.000000
Click here for the solution ```python df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr() ```

Continuous Numerical Variables:

Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data. This will be useful later on for visualizing the fit of the simple linear regression model as well.

Let's see several examples of different linear relationships:

Positive Linear Relationship</h4>

Let's find the scatterplot of "engine-size" and "price".

# Engine size as potential predictor variable of price
plt.figure(10)
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
plt.show()

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see that it's approximately 0.87.

df[["engine-size", "price"]].corr()
engine-size price
engine-size 1.000000 0.872335
price 0.872335 1.000000

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-mpg" and "price".

plt.figure(20)
sns.regplot(x="highway-mpg", y="price", data=df)
plt.show()

As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704.

df[['highway-mpg', 'price']].corr()
highway-mpg price
highway-mpg 1.000000 -0.704692
price -0.704692 1.000000

Weak Linear Relationship

Let's see if "peak-rpm" is a predictor variable of "price".

plt.figure(30)
sns.regplot(x="peak-rpm", y="price", data=df)
plt.show()

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616.

df[['peak-rpm','price']].corr()
peak-rpm price
peak-rpm 1.000000 -0.101616
price -0.101616 1.000000

Question 3 a):

Find the correlation between x="stroke" and y="price".

Hint: if you would like to select those columns, use the following syntax: df[["stroke","price"]].

# Write your code below and press Shift+Enter to execute
df[['stroke','price']].corr()
stroke price
stroke 1.00000 0.08231
price 0.08231 1.00000
Click here for the solution ```python #The correlation is 0.0823, the non-diagonal elements of the table. df[["stroke","price"]].corr() ```

Question 3 b):

Given the correlation results between "price" and "stroke", do you expect a linear relationship?

Verify your results using the function "regplot()".

# Write your code below and press Shift+Enter to execute 
Answer='no'
plt.figure(0)
sns.regplot(x=df['stroke'],y=df['price'])
plt.show()
Click here for the solution ```python #There is a weak correlation between the variable 'stroke' and 'price.' as such regression will not work well. We can see this using "regplot" to demonstrate this. #Code: sns.regplot(x="stroke", y="price", data=df) ```

Categorical Variables

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.

Let's look at the relationship between "body-style" and "price".

plt.figure(1)
sns.boxplot(x="body-style", y="price", data=df)
plt.show()

We see that the distributions of price between the different body-style categories have a significant overlap, so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price":

plt.figure(2)
sns.boxplot(x="engine-location", y="price", data=df)
plt.show()

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

Let's examine "drive-wheels" and "price".

# drive-wheels
plt.figure(3)
sns.boxplot(x="drive-wheels", y="price", data=df)
plt.show()

Here we see that the distribution of price between the different drive-wheels categories differs. As such, drive-wheels could potentially be a predictor of price.

Descriptive Statistical Analysis

Let's first take a look at the variables by utilizing a description method.

The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.

This will show:

  • the count of that variable
  • the mean
  • the standard deviation (std)
  • the minimum value
  • the IQR (Interquartile Range: 25%, 50% and 75%)
  • the maximum value
    • We can apply the method "describe" as follows:

      df.describe()
      
      symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
      count 201.000000 201.00000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 197.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000
      mean 0.840796 122.00000 98.797015 0.837102 0.915126 53.766667 2555.666667 126.875622 3.330692 3.256904 10.164279 103.405534 5117.665368 25.179104 30.686567 13207.129353 9.944145 0.099502 0.900498
      std 1.254802 31.99625 6.066366 0.059213 0.029187 2.447822 517.296727 41.546834 0.268072 0.319256 4.004965 37.365700 478.113805 6.423220 6.815150 7947.066342 2.534599 0.300083 0.300083
      min -2.000000 65.00000 86.600000 0.678039 0.837500 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000 4.795918 0.000000 0.000000
      25% 0.000000 101.00000 94.500000 0.801538 0.890278 52.000000 2169.000000 98.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7775.000000 7.833333 0.000000 1.000000
      50% 1.000000 122.00000 97.000000 0.832292 0.909722 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5125.369458 24.000000 30.000000 10295.000000 9.791667 0.000000 1.000000
      75% 2.000000 137.00000 102.400000 0.881788 0.925000 55.500000 2926.000000 141.000000 3.580000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16500.000000 12.368421 0.000000 1.000000
      max 3.000000 256.00000 120.900000 1.000000 1.000000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 262.000000 6600.000000 49.000000 54.000000 45400.000000 18.076923 1.000000 1.000000

      The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:

      df.describe(include=['object'])
      
      make aspiration num-of-doors body-style drive-wheels engine-location engine-type num-of-cylinders fuel-system horsepower-binned
      count 201 201 201 201 201 201 201 201 201 200
      unique 22 2 2 5 3 2 6 7 8 3
      top toyota std four sedan fwd front ohc four mpfi Low
      freq 32 165 115 94 118 198 145 157 92 115

      Value Counts

      Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket df['drive-wheels'], not two brackets df[['drive-wheels']].

      df['drive-wheels'].value_counts()
      
      drive-wheels
      fwd    118
      rwd     75
      4wd      8
      Name: count, dtype: int64

      We can convert the series to a dataframe as follows:

      df['drive-wheels'].value_counts().to_frame()
      
      count
      drive-wheels
      fwd 118
      rwd 75
      4wd 8

      Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column 'drive-wheels' to 'value_counts'.

      drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
      drive_wheels_counts.reset_index(inplace=True)
      drive_wheels_counts=drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'})
      drive_wheels_counts
      
      value_counts count
      0 fwd 118
      1 rwd 75
      2 4wd 8

      Now let's rename the index to 'drive-wheels':

      drive_wheels_counts.index.name = 'drive-wheels'
      drive_wheels_counts
      
      value_counts count
      drive-wheels
      0 fwd 118
      1 rwd 75
      2 4wd 8

      We can repeat the above process for the variable 'engine-location'.

      # engine-location as variable
      engine_loc_counts = df['engine-location'].value_counts().to_frame()
      engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
      engine_loc_counts.index.name = 'engine-location'
      engine_loc_counts.head(10)
      
      count
      engine-location
      front 198
      rear 3

      After examining the value counts of the engine location, we see that engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any conclusions about the engine location.

      Basics of Grouping

      The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.

      For example, let's group by the variable "drive-wheels". We see that there are 3 different categories of drive wheels.

      df['drive-wheels'].unique()
      
      array(['rwd', 'fwd', '4wd'], dtype=object)

      If we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them.

      We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable "df_group_one".

      df_group_one = df[['drive-wheels','body-style','price']]
      

      We can then calculate the average price for each of the different categories of data.

      # grouping results
      df_grouped = df_group_one.groupby(['drive-wheels'], as_index=False).agg({'price': 'mean'})
      df_grouped
      
      drive-wheels price
      0 4wd 10241.000000
      1 fwd 9244.779661
      2 rwd 19757.613333

      From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.

      You can also group by multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.

      # grouping results
      df_gptest = df[['drive-wheels','body-style','price']]
      grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
      grouped_test1
      
      drive-wheels body-style price
      0 4wd hatchback 7603.000000
      1 4wd sedan 12647.333333
      2 4wd wagon 9095.750000
      3 fwd convertible 11595.000000
      4 fwd hardtop 8249.000000
      5 fwd hatchback 8396.387755
      6 fwd sedan 9811.800000
      7 fwd wagon 9997.333333
      8 rwd convertible 23949.600000
      9 rwd hardtop 24202.714286
      10 rwd hatchback 14337.777778
      11 rwd sedan 21711.833333
      12 rwd wagon 16994.222222

      This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.

      In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-style to become the columns of the table:

      grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
      grouped_pivot
      
      price
      body-style convertible hardtop hatchback sedan wagon
      drive-wheels
      4wd NaN NaN 7603.000000 12647.333333 9095.750000
      fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
      rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

      Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.

      grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
      grouped_pivot
      
      price
      body-style convertible hardtop hatchback sedan wagon
      drive-wheels
      4wd 0.0 0.000000 7603.000000 12647.333333 9095.750000
      fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
      rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

      Question 4:

      Use the "groupby" function to find the average "price" of each car based on "body-style".

      # Write your code below and press Shift+Enter to execute 
      df_t=df[["body-style","price"]]
      df_t1=df.groupby(['body-style'],as_index=False).agg({'price':'mean'})
      df_t1
      
      body-style price
      0 convertible 21890.500000
      1 hardtop 22208.500000
      2 hatchback 9957.441176
      3 sedan 14459.755319
      4 wagon 12371.960000
      Click here for the solution ```python # grouping results df_gptest2 = df[['body-style','price']] grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index= False).mean() grouped_test_bodystyle ```

      If you did not import "pyplot", let's do it again.

      import matplotlib.pyplot as plt
      %matplotlib inline 
      

      Variables: Drive Wheels and Body Style vs. Price

      Let's use a heat map to visualize the relationship between Body Style vs Price.

      #use the grouped results
      plt.figure(1)
      plt.pcolor(grouped_pivot, cmap='coolwarm')
      plt.colorbar()
      plt.show()
      

      The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' on the vertical and horizontal axis, respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.

      The default labels convey no useful information to us. Let's change that:

      fig, ax = plt.subplots()
      im = ax.pcolor(grouped_pivot, cmap='coolwarm')
      
      #label names
      row_labels = grouped_pivot.columns.levels[1]
      col_labels = grouped_pivot.index
      
      #move ticks and labels to the center
      ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
      ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)
      
      #insert labels
      ax.set_xticklabels(row_labels, minor=False)
      ax.set_yticklabels(col_labels, minor=False)
      
      #rotate label if too long
      plt.xticks(rotation=45)
      
      fig.colorbar(im)
      plt.show()
      

      Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.

      The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the car price?".

      To get a better measure of the important characteristics, we look at the correlation of these variables with the car price. In other words: how is the car price dependent on this variable?

      Correlation and Causation

      Correlation: a measure of the extent of interdependence between variables.

      Causation: the relationship between cause and effect between two variables.

      It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.

      Pearson Correlation

      The Pearson Correlation measures the linear dependence between two variables X and Y.

      The resulting coefficient is a value between -1 and 1 inclusive, where:

      • 1: Perfect positive linear correlation.
      • 0: No linear correlation, the two variables most likely do not affect each other.
      • -1: Perfect negative linear correlation.

      Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.

      df.select_dtypes(include=['number']).corr()
      
      symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
      symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118 -0.110581 -0.140019 -0.008245 -0.182196 0.075819 0.279740 -0.035527 0.036233 -0.082391 0.066171 -0.196735 0.196735
      normalized-losses 0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404 0.112360 -0.029862 0.055563 -0.114713 0.217299 0.239543 -0.225016 -0.181877 0.133999 0.238567 -0.101546 0.101546
      wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097 0.572027 0.493244 0.158502 0.250313 0.371147 -0.360305 -0.470606 -0.543304 0.584642 0.476153 0.307237 -0.307237
      length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665 0.685025 0.608971 0.124139 0.159733 0.579821 -0.285970 -0.665192 -0.698142 0.690628 0.657373 0.211187 -0.211187
      width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201 0.729436 0.544885 0.188829 0.189867 0.615077 -0.245800 -0.633531 -0.680635 0.751265 0.673363 0.244356 -0.244356
      height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581 0.074694 0.180449 -0.062704 0.259737 -0.087027 -0.309974 -0.049800 -0.104812 0.135486 0.003811 0.281578 -0.281578
      curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000 0.849072 0.644060 0.167562 0.156433 0.757976 -0.279361 -0.749543 -0.794889 0.834415 0.785353 0.221046 -0.221046
      engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072 1.000000 0.572609 0.209523 0.028889 0.822676 -0.256733 -0.650546 -0.679571 0.872335 0.745059 0.070779 -0.070779
      bore -0.140019 -0.029862 0.493244 0.608971 0.544885 0.180449 0.644060 0.572609 1.000000 -0.055390 0.001263 0.566936 -0.267392 -0.582027 -0.591309 0.543155 0.554610 0.054458 -0.054458
      stroke -0.008245 0.055563 0.158502 0.124139 0.188829 -0.062704 0.167562 0.209523 -0.055390 1.000000 0.187923 0.098462 -0.065713 -0.034696 -0.035201 0.082310 0.037300 0.241303 -0.241303
      compression-ratio -0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433 0.028889 0.001263 0.187923 1.000000 -0.214514 -0.435780 0.331425 0.268465 0.071107 -0.299372 0.985231 -0.985231
      horsepower 0.075819 0.217299 0.371147 0.579821 0.615077 -0.087027 0.757976 0.822676 0.566936 0.098462 -0.214514 1.000000 0.107885 -0.822214 -0.804575 0.809575 0.889488 -0.169053 0.169053
      peak-rpm 0.279740 0.239543 -0.360305 -0.285970 -0.245800 -0.309974 -0.279361 -0.256733 -0.267392 -0.065713 -0.435780 0.107885 1.000000 -0.115413 -0.058598 -0.101616 0.115830 -0.475812 0.475812
      city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543 -0.650546 -0.582027 -0.034696 0.331425 -0.822214 -0.115413 1.000000 0.972044 -0.686571 -0.949713 0.265676 -0.265676
      highway-mpg 0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889 -0.679571 -0.591309 -0.035201 0.268465 -0.804575 -0.058598 0.972044 1.000000 -0.704692 -0.930028 0.198690 -0.198690
      price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415 0.872335 0.543155 0.082310 0.071107 0.809575 -0.101616 -0.686571 -0.704692 1.000000 0.789898 0.110326 -0.110326
      city-L/100km 0.066171 0.238567 0.476153 0.657373 0.673363 0.003811 0.785353 0.745059 0.554610 0.037300 -0.299372 0.889488 0.115830 -0.949713 -0.930028 0.789898 1.000000 -0.241282 0.241282
      diesel -0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046 0.070779 0.054458 0.241303 0.985231 -0.169053 -0.475812 0.265676 0.198690 0.110326 -0.241282 1.000000 -1.000000
      gas 0.196735 0.101546 -0.307237 -0.211187 -0.244356 -0.281578 -0.221046 -0.070779 -0.054458 -0.241303 -0.985231 0.169053 0.475812 -0.265676 -0.198690 -0.110326 0.241282 -1.000000 1.000000

      Sometimes we would like to know the significant of the correlation estimate.

      P-value

      What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

      By convention, when the

      • p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
      • the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
      • the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
      • the p-value is $>$ 0.1: there is no evidence that the correlation is significant.

      We can obtain this information using "stats" module in the "scipy" library.

      from scipy import stats
      

      Wheel-Base vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

      pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  
      
      The Pearson Correlation Coefficient is 0.584641822265508  with a P-value of P = 8.076488270732885e-20
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585).

      Horsepower vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

      pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
      
      The Pearson Correlation Coefficient is 0.8095745670036559  with a P-value of P =  6.369057428259557e-48
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

      Length vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.

      pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
      
      The Pearson Correlation Coefficient is 0.6906283804483638  with a P-value of P =  8.016477466159723e-30
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

      Width vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':

      pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 
      
      The Pearson Correlation Coefficient is 0.7512653440522673  with a P-value of P = 9.20033551048206e-38
      

      Conclusion:

      Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).

      Curb-Weight vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':

      pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
      print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
      
      The Pearson Correlation Coefficient is 0.8344145257702843  with a P-value of P =  2.189577238893965e-53
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).

      Engine-Size vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':

      pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 
      
      The Pearson Correlation Coefficient is 0.8723351674455185  with a P-value of P = 9.265491622198793e-64
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

      Bore vs. Price

      Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':

      pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value ) 
      
      The Pearson Correlation Coefficient is 0.5431553832626602  with a P-value of P =   8.049189483935315e-17
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

      We can relate the process for each 'city-mpg' and 'highway-mpg':

      City-mpg vs. Price

      pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
      print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
      
      The Pearson Correlation Coefficient is -0.6865710067844678  with a P-value of P =  2.3211320655675098e-29
      

      Conclusion:

      Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.

      Highway-mpg vs. Price

      pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
      print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value ) 
      
      The Pearson Correlation Coefficient is -0.704692265058953  with a P-value of P =  1.749547114447557e-31
      
      from  sklearn.linear_model import LinearRegression
      lm = LinearRegression()
      X = df[['highway-mpg']]
      Y = df['price']
      lm.fit(X, Y)
      out=lm.score(X,Y)
      print(lm.coef_)
      out
      
      [-821.73337832]
      
      0.4965911884339175

      Notebook for model development

      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      

      Load the data and store it in dataframe df:

      #from pyodide.http import pyfetch
      
      #async def download(url, filename):
       #   response = await pyfetch(url)
        #  if response.status == 200:
         #     with open(filename, "wb") as f:
          #        f.write(await response.bytes())
      
      #file_path= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
      
      #await download(file_path, "usedcars.csv")
      #file_name="usedcars.csv"
      Jupyter_Notesath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
      df = pd.read_csv(Jupyter_Notesath, header=0)
      
      #df = pd.read_csv(file_name)
      df.head()
      
      symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned diesel gas
      0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 13495.0 11.190476 Medium 0 1
      1 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 16500.0 11.190476 Medium 0 1
      2 1 122 alfa-romero std two hatchback rwd front 94.5 0.822681 ... 9.0 154.0 5000.0 19 26 16500.0 12.368421 Medium 0 1
      3 2 164 audi std four sedan fwd front 99.8 0.848630 ... 10.0 102.0 5500.0 24 30 13950.0 9.791667 Medium 0 1
      4 2 164 audi std four sedan 4wd front 99.4 0.848630 ... 8.0 115.0 5500.0 18 22 17450.0 13.055556 Medium 0 1

      5 rows × 29 columns

      Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply skip the steps above, and simply use the URL directly in the pandas.read_csv() function. You can uncomment and run the statements in the cell below.

      #Jupyter_Notesath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv"
      #df = pd.read_csv(Jupyter_Notesath, header=None)
      

      1. Linear Regression and Multiple Linear Regression

      Linear Regression

      One example of a Data Model that we will be using is:

      Simple Linear Regression

      Simple Linear Regression is a method to help us understand the relationship between two variables:

      • The predictor/independent variable (X)
      • The response/dependent variable (that we want to predict)(Y)

      The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.

      $$ Y: Response \ Variable\\\\\\ X: Predictor \ Variables $$

      Linear Function $$ Yhat = a + b X $$

      • a refers to the intercept of the regression line, in other words: the value of Y when X is 0
      • b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit

      Let's load the modules for linear regression:

      from sklearn.linear_model import LinearRegression
      

      Create the linear regression object:

      lm = LinearRegression()
      lm
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

      How could "highway-mpg" help us predict car price?

      For this example, we want to look at how highway-mpg can help us predict car price. Using simple linear regression, we will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.

      print(df.columns)
      X = df[['highway-mpg']]
      Y = df[['price']]
      
      Index(['symboling', 'normalized-losses', 'make', 'aspiration', 'num-of-doors',
             'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length',
             'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
             'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio',
             'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price',
             'city-L/100km', 'horsepower-binned', 'diesel', 'gas'],
            dtype='object')
      

      Fit the linear model using highway-mpg:

      lm.fit(X,Y)
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

      We can output a prediction:

      Yhat=lm.predict(X)
      Yhat[0:5]   
      
      array([[16236.50464347],
             [16236.50464347],
             [17058.23802179],
             [13771.3045085 ],
             [20345.17153508]])

      What is the value of the intercept (a)?

      lm.intercept_
      
      array([38423.30585816])

      What is the value of the slope (b)?

      lm.coef_
      
      array([[-821.73337832]])

      What is the final estimated linear model we get?

      As we saw above, we should get a final linear model with the structure:

      $$ Yhat = a + b X $$

      Plugging in the actual values we get:

      Price = 38423.31 - 821.73 x highway-mpg

      Question #1 a):

      Create a linear regression object called "lm1".
      # Write your code below and press Shift+Enter to execute 
      lm1=LinearRegression()
      lm1
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Click here for the solution ```python lm1 = LinearRegression() lm1 ```

      Question #1 b):

      Train the model using "engine-size" as the independent variable and "price" as the dependent variable?
      # Write your code below and press Shift+Enter to execute 
      cola= ["engine-size"]
      colb=["price"]
      lm1.fit(df[cola],df[colb])
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Click here for the solution ```python lm1.fit(df[['engine-size']], df[['price']]) lm1 ```

      Question #1 c):

      Find the slope and intercept of the model.

      Slope

      # Write your code below and press Shift+Enter to execute 
      lm1.coef_
      
      array([[166.86001569]])

      Intercept

      # Write your code below and press Shift+Enter to execute 
      lm1.intercept_
      
      array([-7963.33890628])
      Click here for the solution ```python # Slope lm1.coef_ # Intercept lm1.intercept_ ```

      Question #1 d):

      What is the equation of the predicted line? You can use x and yhat or "engine-size" or "price".
      # Write your code below and press Shift+Enter to execute 
      print(["Eq. is Y=mx+b: Predicted price = {:.3f} * 'engine-size' {:.3f}".format(lm1.coef_[0][0],[np.sign(lm1.intercept_)[0]*abs(lm1.intercept_)][0][0])])
      
      ["Eq. is Y=mx+b: Predicted price = 166.860 * 'engine-size' -7963.339"]
      
      Click here for the solution ```python # using X and Y Yhat=-7963.34 + 166.86*X Price=-7963.34 + 166.86*df['engine-size'] ```

      Multiple Linear Regression

      What if we want to predict car price using more than one variable?

      If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:

      $$ Y: Response \ Variable\\\\\\ X_1 :Predictor\ Variable \ 1\\ X_2: Predictor\ Variable \ 2\\ X_3: Predictor\ Variable \ 3\\ X_4: Predictor\ Variable \ 4\\ $$ $$ a: intercept\\\\\\ b_1 :coefficients \ of\ Variable \ 1\\ b_2: coefficients \ of\ Variable \ 2\\ b_3: coefficients \ of\ Variable \ 3\\ b_4: coefficients \ of\ Variable \ 4\\ $$

      The equation is given by:

      $$ Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 $$

      From the previous section we know that other good predictors of price could be:

      • Horsepower
      • Curb-weight
      • Engine-size
      • Highway-mpg
      Let's develop a model using these variables as the predictor variables.

      Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
      

      Fit the linear model using the four above-mentioned variables.

      lm.fit(Z, df['price'])
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

      What is the value of the intercept(a)?

      lm.intercept_
      
      -15806.624626329198

      What are the values of the coefficients (b1, b2, b3, b4)?

      lm.coef_
      
      array([53.49574423,  4.70770099, 81.53026382, 36.05748882])

      What is the final estimated linear model that we get?

      As we saw above, we should get a final linear function with the structure:

      $$ Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 $$

      What is the linear function we get in this example?

      Price = -15678.742628061467 + 52.65851272 x horsepower + 4.69878948 x curb-weight + 81.95906216 x engine-size + 33.58258185 x highway-mpg

      Question #2 a):

      Create and train a Multiple Linear Regression model "lm2" where the response variable is "price", and the predictor variable is "normalized-losses" and "highway-mpg".
      # Write your code below and press Shift+Enter to execute 
      lm2=LinearRegression()
      dats=df[["normalized-losses", "highway-mpg"]]
      lm2.fit(dats,df['price'])
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Click here for the solution ```python lm2 = LinearRegression() lm2.fit(df[['normalized-losses' , 'highway-mpg']],df['price']) ```

      Question #2 b):

      Find the coefficient of the model.
      # Write your code below and press Shift+Enter to execute 
      lm2.coef_
      
      array([   1.49789586, -820.45434016])
      Click here for the solution ```python lm2.coef_ ```

      2. Model Evaluation Using Visualization

      Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.

      Import the visualization package, seaborn:

      # import the visualization package: seaborn
      import seaborn as sns
      %matplotlib inline 
      

      Regression Plot

      When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.

      This plot will show a combination of a scattered data points (a scatterplot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).

      Let's visualize highway-mpg as potential predictor variable of price:

      width = 12
      height = 10
      plt.figure(figsize=(width, height))
      sns.regplot(x="highway-mpg", y="price", data=df)
      plt.ylim(0,)
      
      (0.0, 48163.11429508339)

      We can see from this plot that price is negatively correlated to highway-mpg since the regression slope is negative. One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data. Let's compare this plot to the regression plot of "peak-rpm".

      plt.figure(figsize=(width, height))
      sns.regplot(x="peak-rpm", y="price", data=df)
      plt.ylim(0,)
      
      (0.0, 47414.1)

      Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for "highway-mpg" are much closer to the generated line and, on average, decrease. The points for "peak-rpm" have more spread around the predicted line and it is much harder to determine if the points are decreasing or increasing as the "peak-rpm" increases.

      Question #3:

      Given the regression plots above, is "peak-rpm" or "highway-mpg" more strongly correlated with "price"? Use the method ".corr()" to verify your answer.
      # Write your code below and press Shift+Enter to execute 
      #import piplite
      #await piplite.install("jinja2")
      corres=df[["peak-rpm","highway-mpg","price"]].corr()
      styled_df = corres[corres<1].style.highlight_min(axis=0,color=[1,1,0])
      styled_df
      
        peak-rpm highway-mpg price
      peak-rpm nan -0.058598 -0.101616
      highway-mpg -0.058598 nan -0.704692
      price -0.101616 -0.704692 nan
      Click here for the solution ```python # The variable "highway-mpg" has a stronger correlation with "price", it is approximate -0.704692 compared to "peak-rpm" which is approximate -0.101616. You can verify it using the following command: df[["peak-rpm","highway-mpg","price"]].corr() ```

      Residual Plot

      A good way to visualize the variance of the data is to use a residual plot.

      What is a residual?

      The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

      So what is a residual plot?

      A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

      What do we pay attention to when looking at a residual plot?

      We look at the spread of the residuals:

      - If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

      width = 6
      height = 5
      plt.figure(figsize=(width, height))
      sns.residplot(x=df['highway-mpg'], y=df['price'])
      plt.show()
      

      What is this plot telling us?

      We can see from this residual plot that the residuals are not randomly spread around the x-axis, leading us to believe that maybe a non-linear model is more appropriate for this data.

      Multiple Linear Regression

      How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.

      One way to look at the fit of the model is by looking at the distribution plot. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.

      First, let's make a prediction:

      Y_hat = lm.predict(Z)
      
      plt.figure(figsize=(width, height))
      
      
      ax1 = sns.kdeplot(df['price'],  color="r", label="Actual Value")
      sns.kdeplot(Y_hat, color="b", label="Fitted Values" , ax=ax1)
      
      
      plt.title('Actual vs Fitted Values for Price')
      plt.xlabel('Price (in dollars)')
      plt.ylabel('Proportion of Cars')
      
      plt.show()
      plt.close()
      

      We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.

      3. Polynomial Regression and Pipelines

      Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.

      We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

      There are different orders of polynomial regression:

      Quadratic - 2nd Order
      $$ Yhat = a + b_1 X +b_2 X^2 $$

      Cubic - 3rd Order
      $$ Yhat = a + b_1 X +b_2 X^2 +b_3 X^3\\\\\\ $$

      Higher-Order:
      $$ Y = a + b_1 X +b_2 X^2 +b_3 X^3 ....\\ $$

      We saw earlier that a linear model did not provide the best fit while using "highway-mpg" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.

      We will use the following function to plot the data:

      def PlotPolly(model, independent_variable, dependent_variabble, Name):
          x_new = np.linspace(15, 55, 100)
          y_new = model(x_new)
      
          plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
          plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
          ax = plt.gca()
          ax.set_facecolor((0.898, 0.898, 0.898))
          fig = plt.gcf()
          plt.xlabel(Name)
          plt.ylabel('Price of Cars')
      
          plt.show()
          plt.close()
      

      Let's get the variables:

      x = df['highway-mpg']
      y = df['price']
      

      Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.

      # Here we use a polynomial of the 3rd order (cubic) 
      f = np.polyfit(x, y, 3)
      p = np.poly1d(f)
      print(p)
      
              3         2
      -1.557 x + 204.8 x - 8965 x + 1.379e+05
      

      Let's plot the function:

      PlotPolly(p, x, y, 'highway-mpg')
      
      np.polyfit(x, y, 3)
      
      array([-1.55663829e+00,  2.04754306e+02, -8.96543312e+03,  1.37923594e+05])

      We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.

      Question #4:

      Create 11 order polynomial model with the variables x and y from above.
      # Write your code below and press Shift+Enter to execute 
      f=np.polyfit(x,y,11)
      p = np.poly1d(f)
      print(p)
      PlotPolly(p,x,y, 'Highway MPG')
      
                  11             10             9           8         7
      -1.243e-08 x  + 4.722e-06 x  - 0.0008028 x + 0.08056 x - 5.297 x
                6        5             4             3             2
       + 239.5 x - 7588 x + 1.684e+05 x - 2.565e+06 x + 2.551e+07 x - 1.491e+08 x + 3.879e+08
      
      Click here for the solution ```python # Here we use a polynomial of the 11rd order (cubic) f1 = np.polyfit(x, y, 11) p1 = np.poly1d(f1) print(p1) PlotPolly(p1,x,y, 'Highway MPG') ```

      The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:

      $$ Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2 $$

      We can perform a polynomial transform on multiple features. First, we import the module:

      from sklearn.preprocessing import PolynomialFeatures
      

      We create a PolynomialFeatures object of degree 2:

      pr=PolynomialFeatures(degree=2)
      pr
      
      PolynomialFeatures()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Z_pr=pr.fit_transform(Z)
      

      In the original data, there are 201 samples and 4 features.

      Z.shape
      
      (201, 4)

      After the transformation, there are 201 samples and 15 features.

      Z_pr.shape
      
      (201, 15)

      Pipeline

      Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.

      from sklearn.pipeline import Pipeline
      from sklearn.preprocessing import StandardScaler
      

      We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.

      Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
      

      We input the list as an argument to the pipeline constructor:

      pipe=Pipeline(Input)
      pipe
      
      Pipeline(steps=[('scale', StandardScaler()),
                      ('polynomial', PolynomialFeatures(include_bias=False)),
                      ('model', LinearRegression())])
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

      First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

      Then, we can normalize the data, perform a transform and fit the model simultaneously.

      Z = Z.astype(float)
      pipe.fit(Z,y)
      
      Pipeline(steps=[('scale', StandardScaler()),
                      ('polynomial', PolynomialFeatures(include_bias=False)),
                      ('model', LinearRegression())])
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

      Similarly, we can normalize the data, perform a transform and produce a prediction simultaneously.

      ypipe=pipe.predict(Z)
      ypipe[0:4]
      
      array([13102.74784201, 13102.74784201, 18225.54572197, 10390.29636555])

      Question #5:

      Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features Z and target y.
      # Write your code below and press Shift+Enter to execute 
      Input=[('scale',StandardScaler()),('model',LinearRegression())]
      Pipe=Pipeline(Input)
      Pipe
      Z = Z.astype(float)
      Pipe.fit(Z,y)
      yhat=Pipe.predict(Z)
      yhat[0:5]
      
      array([13699.11161184, 13699.11161184, 19051.65470233, 10620.36193015,
             15521.31420211])
      Click here for the solution ```python Input=[('scale',StandardScaler()),('model',LinearRegression())] pipe=Pipeline(Input) pipe.fit(Z,y) ypipe=pipe.predict(Z) ypipe[0:10] ```

      4. Measures for In-Sample Evaluation

      When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.

      Two very important measures that are often used in Statistics to determine the accuracy of a model are:

      • R^2 / R-squared
      • Mean Squared Error (MSE)
      R-squared

      R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.

      The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.

      Mean Squared Error (MSE)

      The Mean Squared Error measures the average of the squares of errors. That is, the difference between actual value (y) and the estimated value (ŷ).

      Model 1: Simple Linear Regression

      Let's calculate the R^2:

      #highway_mpg_fit
      lm.fit(X, Y)
      # Find the R^2
      print('The R-square is: ', lm.score(X, Y))
      
      The R-square is:  0.4965911884339175
      

      We can say that ~49.659% of the variation of the price is explained by this simple linear model "horsepower_fit".

      Let's calculate the MSE:

      We can predict the output i.e., "yhat" using the predict method, where X is the input variable:

      Yhat=lm.predict(X)
      print('The output of the first four predicted value is: ', Yhat[0:4])
      
      The output of the first four predicted value is:  [[16236.50464347]
       [16236.50464347]
       [17058.23802179]
       [13771.3045085 ]]
      

      Let's import the function mean_squared_error from the module metrics:

      from sklearn.metrics import mean_squared_error
      

      We can compare the predicted results with the actual results:

      mse = mean_squared_error(df['price'], Yhat)
      print('The mean square error of price and predicted value is: ', mse)
      
      The mean square error of price and predicted value is:  31635042.944639895
      

      Model 2: Multiple Linear Regression

      Let's calculate the R^2:

      # fit the model 
      lm.fit(Z, df['price'])
      # Find the R^2
      print('The R-square is: ', lm.score(Z, df['price']))
      
      The R-square is:  0.8093562806577457
      

      We can say that ~80.896 % of the variation of price is explained by this multiple linear regression "multi_fit".

      Let's calculate the MSE.

      We produce a prediction:

      Y_predict_multifit = lm.predict(Z)
      

      We compare the predicted results with the actual results:

      print('The mean square error of price and predicted value using multifit is: ', \
            mean_squared_error(df['price'], Y_predict_multifit))
      
      The mean square error of price and predicted value using multifit is:  11980366.87072649
      

      Model 3: Polynomial Fit

      Let's calculate the R^2.

      Let’s import the function r2_score from the module metrics as we are using a different function.

      from sklearn.metrics import r2_score
      

      We apply the function to get the value of R^2:

      r_squared = r2_score(y, p(x))
      print('The R-square value is: ', r_squared)
      
      The R-square value is:  0.702376909204032
      

      We can say that ~67.419 % of the variation of price is explained by this polynomial fit.

      MSE

      We can also calculate the MSE:

      mean_squared_error(df['price'], p(x))
      
      18703127.64164033

      5. Prediction and Decision Making

      Prediction

      In the previous section, we trained the model using the method fit. Now we will use the method predict to produce a prediction. Lets import pyplot for plotting; we will also be using some functions from numpy.

      import matplotlib.pyplot as plt
      import numpy as np
      
      %matplotlib inline 
      

      Create a new input:

      new_input=np.arange(1, 100, 1).reshape(-1, 1)
      

      Fit the model:

      lm.fit(X, Y)
      lm
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

      Produce a prediction:

      yhat=lm.predict(new_input)
      yhat[0:5]
      
      array([[37601.57247984],
             [36779.83910151],
             [35958.10572319],
             [35136.37234487],
             [34314.63896655]])

      We can plot the data:

      plt.plot(new_input, yhat)
      plt.show()
      

      Decision Making: Determining a Good Model Fit

      Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?

      • What is a good R-squared value?

      When comparing models, the model with the higher R-squared value is a better fit for the data.

      • What is a good MSE?

      When comparing models, the model with the smallest MSE value is a better fit for the data.

      Let's take a look at the values for the different models.

      Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.

      • R-squared: 0.49659118843391759
      • MSE: 3.16 x10^7

      Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.

      • R-squared: 0.80896354913783497
      • MSE: 1.2 x10^7

      Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.

      • R-squared: 0.6741946663906514
      • MSE: 2.05 x 10^7

      Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)

      Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you should always check the MSE and R^2.

      In order to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.

      • MSE: The MSE of SLR is 3.16x10^7 while MLR has an MSE of 1.2 x10^7. The MSE of MLR is much smaller.
      • R-squared: In this case, we can also see that there is a big difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (~0.497) is very small compared to the R-squared for the MLR (~0.809).

      This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.

      Simple Linear Model (SLR) vs. Polynomial Fit

      • MSE: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.
      • R-squared: The R-squared for the Polynomial Fit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.

      Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting "price" with "highway-mpg" as a predictor variable.

      Multiple Linear Regression (MLR) vs. Polynomial Fit

      • MSE: The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
      • R-squared: The R-squared for the MLR is also much larger than for the Polynomial Fit.

      Conclusion

      Comparing these three models, we conclude that the MLR model is the best model to be able to predict price from our dataset. This result makes sense since we have 27 variables in total and we know that more than one of those variables are potential predictors of the final car price.

      Module 4 summary

      • Linear regression refers to using one independent variable to make a prediction.
      • You can use multiple linear regression to explain the relationship between one continuous target y variable and two or more predictor x variables.
      • Simple linear regression, or SLR, is a method used to understand the relationship between two variables, the predictor independent variable x and the target dependent variable y.
      • Use the regplot and residplot functions in the Seaborn library to create regression and residual plots, which help you identify the strength, direction, and linearity of the relationship between your independent and dependent variables.
      • When using residual plots for model evaluation, residuals should ideally have zero mean, appear evenly distributed around the x-axis, and have consistent variance. If these conditions are not met, consider adjusting your model.
      • Use distribution plots for models with multiple features: Learn to construct distribution plots to compare predicted and actual values, particularly when your model includes more than one independent variable. Know that this can offer deeper insights into the accuracy of your model across different ranges of values.
      • The order of the polynomials affects the fit of the model to your data. Apply Python's polyfit function to develop polynomial regression models that suit your specific dataset.
      • To prepare your data for more accurate modeling, use feature transformation techniques, particularly using the preprocessing library in scikit-learn, transform your data using polynomial features, and use the modules like StandardScaler to normalize the data.
      • Pipelines allow you to simplify how you perform transformations and predictions sequentially, and you can use pipelines in scikit-learn to streamline your modeling process.
      • You can construct and train a pipeline to automate tasks such as normalization, polynomial transformation, and making predictions.
      • To determine the fit of your model, you can perform sample evaluations by using the Mean Square Error (MSE), using Python’s mean_squared_error function from scikit-learn, and using the score method to obtain the R-squared value.
      • A model with a high R-squared value close to 1 and a low MSE is generally a good fit, whereas a model with a low R-squared and a high MSE may not be useful.
      • Be alert to situations where your R-squared value might be negative, which can indicate overfitting.
      • When evaluating models, use visualization and numerical measures and compare different models.
      • The mean square error is perhaps the most intuitive numerical measure for determining whether a model is good.
      • A distribution plot is a suitable method for multiple linear regression.
      • An acceptable r-squared value depends on what you are studying and your use case.
      • To evaluate your model’s fit, apply visualization, methods like regression and residual plots, and numerical measures such as the model's coefficients for sensibility:
        • Use Mean Square Error (MSE) to measure the average of the squares of the errors between actual and predicted values and examine R-squared to understand the proportion of the variance in the dependent variable that is predictable from the independent variables.
        • When analyzing residual plots, residuals should be randomly distributed around zero for a good model. In contrast, a residual plot curve or inaccuracies in certain ranges suggest non-linear behavior or the need for more data.

      Module 5

      Introduction to Ridge Regression

      For models with multiple independent features and ones with polynomial feature extrapolation, it is common to have colinear combinations of features. Left unchecked, this multicollinearity of features can lead the model to overfit the training data. To control this, the feature sets are typically regularized using hyperparameters.

      Ridge regression is the process of regularizing the feature set using the hyperparameter alpha. The upcoming video shows how Ridge regression can be utilized to regularize and reduce standard errors and avoid over-fitting while using a regression model.

      NB for practice Project: Insurance Cost Analysis**

      Estimated time needed: 75 minutes

      In this project, you have to perform analytics operations on an insurance database that uses the below mentioned parameters.

      Parameter Description Content type
      age Age in years integer
      gender Male or Female integer (1 or 2)
      bmi Body mass index float
      no_of_children Number of children integer
      smoker Whether smoker or not integer (0 or 1)
      region Which US region - NW, NE, SW, SE integer (1,2,3 or 4 respectively)
      charges Annual Insurance charges in USD float

      Objectives

      In this project, you will:

      • Load the data as a pandas dataframe
      • Clean the data, taking care of the blank entries
      • Run exploratory data analysis (EDA) and identify the attributes that most affect the charges
      • Develop single variable and multi variable Linear Regression models for predicting the charges
      • Use Ridge regression to refine the performance of Linear regression models.

      Setup

      • Define a function to plot comparison between Actual ad predicted distributions
      def plotdists(D1,D2,L1,L2,Lg):
          #fig, ax=plt.subplots()
          sns.kdeplot(D1,color='b',label=L1)
          sns.kdeplot(D2,color='r',label=L2)
          plt.legend(Lg)
      

      Importing Required Libraries

      We recommend you import all required libraries in one place (here):

      import warnings
      import pandas as pd
      import numpy as np
      import seaborn as sns
      import matplotlib.pyplot as plt
      from sklearn.linear_model import LinearRegression, Ridge
      from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
      from sklearn.preprocessing import PolynomialFeatures, StandardScaler
      from sklearn.pipeline import Pipeline
      warnings.filterwarnings('ignore')
      
      Click here for Solution ```python import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import LinearRegression, Ridge from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import cross_val_score, train_test_split ```

      Download the dataset to this lab environment

      Task 1 : Import the dataset

      Import the dataset into a pandas dataframe. Note that there are currently no headers in the CSV file.

      Print the first 10 rows of the dataframe to confirm successful loading.

      import pandas as pd
      Jupyter_Notesath = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/medical_insurance_dataset.csv'
      df = pd.read_csv(Jupyter_Notesath, header=None)
      df.head(10)
      
      0 1 2 3 4 5 6
      0 19 1 27.900 0 1 3 16884.92400
      1 18 2 33.770 1 0 4 1725.55230
      2 28 2 33.000 3 0 4 4449.46200
      3 33 2 22.705 0 0 1 21984.47061
      4 32 2 28.880 0 0 1 3866.85520
      5 31 1 25.740 0 ? 4 3756.62160
      6 46 1 33.440 1 0 4 8240.58960
      7 37 1 27.740 3 0 1 7281.50560
      8 37 2 29.830 2 0 2 6406.41070
      9 60 1 25.840 0 0 1 28923.13692
      Click here for Solution ```python df = pd.read_csv(file_name, header=None) print(df.head(10)) ```

      Add the headers to the dataframe, as mentioned in the project scenario.

      cols=['age','gender','bmi','no_of_children','smoker','region','charges']
      df.columns=cols
      df.head()
      
      age gender bmi no_of_children smoker region charges
      0 19 1 27.900 0 1 3 16884.92400
      1 18 2 33.770 1 0 4 1725.55230
      2 28 2 33.000 3 0 4 4449.46200
      3 33 2 22.705 0 0 1 21984.47061
      4 32 2 28.880 0 0 1 3866.85520
      Click here for Solution ```python headers = ["age", "gender", "bmi", "no_of_children", "smoker", "region", "charges"] df.columns = headers ```

      Now, replace the '?' entries with 'NaN' values.

      df.replace('?',np.nan,inplace=True)
      
      Click here for Solution ```python df.replace('?', np.nan, inplace = True) ```

      Task 2 : Data Wrangling

      Use dataframe.info() to identify the columns that have some 'Null' (or NaN) information.

      df.info()
      
      <class 'pandas.core.frame.DataFrame'>
      RangeIndex: 2772 entries, 0 to 2771
      Data columns (total 7 columns):
       #   Column          Non-Null Count  Dtype  
      ---  ------          --------------  -----  
       0   age             2768 non-null   object 
       1   gender          2772 non-null   int64  
       2   bmi             2772 non-null   float64
       3   no_of_children  2772 non-null   int64  
       4   smoker          2765 non-null   object 
       5   region          2772 non-null   int64  
       6   charges         2772 non-null   float64
      dtypes: float64(2), int64(3), object(2)
      memory usage: 151.7+ KB
      
      Click here for Solution ```python print(df.info()) ```

      Handle missing data:

      • For continuous attributes (e.g., age), replace missing values with the mean.
      • For categorical attributes (e.g., smoker), replace missing values with the most frequent value.
      • Update the data types of the respective columns.
      • Verify the update using df.info().
      is_smoker = df['smoker'].value_counts().idxmax()
      df["smoker"].replace(np.nan, is_smoker,inplace=True)
      df['smoker']=df['smoker'].astype(int)
      
      mnage = df['age'].value_counts().mean()
      df["age"].replace(np.nan, mnage,inplace=True)
      df['age']=df['age'].astype(int)
      df.info();
      
      <class 'pandas.core.frame.DataFrame'>
      RangeIndex: 2772 entries, 0 to 2771
      Data columns (total 7 columns):
       #   Column          Non-Null Count  Dtype  
      ---  ------          --------------  -----  
       0   age             2772 non-null   int32  
       1   gender          2772 non-null   int64  
       2   bmi             2772 non-null   float64
       3   no_of_children  2772 non-null   int64  
       4   smoker          2772 non-null   int32  
       5   region          2772 non-null   int64  
       6   charges         2772 non-null   float64
      dtypes: float64(2), int32(2), int64(3)
      memory usage: 130.1 KB
      
      Click here for Solution ```python # smoker is a categorical attribute, replace with most frequent entry is_smoker = df['smoker'].value_counts().idxmax() df["smoker"].replace(np.nan, is_smoker, inplace=True) # age is a continuous variable, replace with mean age mean_age = df['age'].astype('float').mean(axis=0) df["age"].replace(np.nan, mean_age, inplace=True) # Update data types df[["age","smoker"]] = df[["age","smoker"]].astype("int") print(df.info()) ```

      Also note, that the charges column has values which are more than 2 decimal places long. Update the charges column such that all values are rounded to nearest 2 decimal places. Verify conversion by printing the first 5 values of the updated dataframe.

      df.head()
      df['charges']=round(df['charges'],2)
      df.head()
      
      age gender bmi no_of_children smoker region charges
      0 19 1 27.900 0 1 3 16884.92
      1 18 2 33.770 1 0 4 1725.55
      2 28 2 33.000 3 0 4 4449.46
      3 33 2 22.705 0 0 1 21984.47
      4 32 2 28.880 0 0 1 3866.86
      Click here for Solution ```python df[["charges"]] = np.round(df[["charges"]],2) print(df.head()) ```

      Task 3 : Exploratory Data Analysis (EDA)

      Implement the regression plot for charges with respect to bmi.

      #lm=LinearRegression()
      f,xs=plt.subplots(2,1)
      sns.regplot(x=df['bmi'],y=df['charges'],data=df,ax=xs[0]);
      sns.regplot(x=df['age'],y=df['charges'],data=df,ax=xs[1]);
      plt.subplots_adjust(hspace=0.4)
      
      Click here for Solution ```python sns.regplot(x="bmi", y="charges", data=df, line_kws={"color": "red"}) plt.ylim(0,) ```

      Implement the box plot for charges with respect to smoker.

      sns.boxplot(data=df,x=df['smoker'],y=df['charges'])
      
      <Axes: xlabel='age', ylabel='charges'>
      Click here for Solution ```python sns.boxplot(x="smoker", y="charges", data=df) ```

      Print the correlation matrix for the dataset.

      fig,ax = plt.subplots()
      mt=ax.pcolor(df.corr())
      ax.set_xticks(np.arange(len(list(df.columns)))+.5,minor=False)
      ax.set_xticklabels(df.columns)
      plt.xticks(rotation=45);
      ax.set_yticks(np.arange(len(list(df.columns)))+.5,minor=False)
      ax.set_yticklabels(df.columns);
      ax.set_aspect('equal', 'box')
      fig.colorbar(mt);
      
      Click here for Solution ```python print(df.corr()) ```

      Task 4 : Model Development

      Fit a linear regression model that may be used to predict the charges value, just by using the smoker attribute of the dataset. Print the $ R^2 $ score of this model.

      lr=LinearRegression()
      lr
      
      LinearRegression()
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      Click here for Solution ```python X = df[['smoker']] Y = df['charges'] lm = LinearRegression() lm.fit(X,Y) print(lm.score(X, Y)) ```

      Fit a linear regression model that may be used to predict the charges value, just by using all other attributes of the dataset. Print the $ R^2 $ score of this model. You should see an improvement in the performance.

      from sklearn.metrics import r2_score
      xdt=df.drop('charges',axis=1)
      ydt=df['charges']
      lr.fit(xdt,ydt)
      lr.score(xdt,ydt)
      yhat=lr.predict(xdt)
      yhat=yhat.reshape(-1,1)
      scr=r2_score(yhat,ydt)
      plotdists(ydt,yhat,'Original','Predicted',[f"R\N{superscript two}={round(scr,4)}"])
      
      Click here for Solution ```python # definition of Y and lm remain same as used in last cell. Z = df[["age", "gender", "bmi", "no_of_children", "smoker", "region"]] lm.fit(Z,Y) print(lm.score(Z, Y)) ```

      Create a training pipeline that uses StandardScaler(), PolynomialFeatures() and LinearRegression() to create a model that can predict the charges value using all the other attributes of the dataset. There should be even further improvement in the performance.

      Poly=PolynomialFeatures()
      Multi_tr=Poly.fit_transform(xdt)
      Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
      pipe=Pipeline(Input)
      pipe.fit(Multi_tr,ydt)
      ypipt=pipe.predict(Multi_tr)
      #print(pipe.score(Multi_tr,ydt))
      plotdists(ydt,ypipt,'Original','Predicted',[f"R\N{superscript two}={round(pipe.score(Multi_tr,ydt),4)}"])
      
      Click here for Solution ```python # Y and Z use the same values as defined in previous cells Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())] pipe=Pipeline(Input) Z = Z.astype(float) pipe.fit(Z,Y) ypipe=pipe.predict(Z) print(r2_score(Y,ypipe)) ```

      Task 5 : Model Refinement

      Split the data into training and testing subsets, assuming that 20% of the data will be reserved for testing.

      xtr,xts,ytr,yts=train_test_split(xdt,ydt,test_size=0.2,random_state=0)
      
      Click here for Solution ```python # Z and Y hold same values as in previous cells x_train, x_test, y_train, y_test = train_test_split(Z, Y, test_size=0.2, random_state=1) ```

      Initialize a Ridge regressor that used hyperparameter $ \alpha = 0.1 $. Fit the model using training data data subset. Print the $ R^2 $ score for the testing data.

      from sklearn.metrics import r2_score
      RigeModel=Ridge(alpha=0.1)
      RigeModel.fit(xtr,ytr)
      test_score= RigeModel.score(xts,yts)
      print('Ridge score= ', test_score)
      yRidge= RigeModel.predict(xts)
      print('Sklearn score= ',r2_score(yRidge,yts))
      plotdists(yts,yRidge,'Original','Predicted',[f"R\N{superscript two}={round(r2_score(yRidge,yts),4)}"])
      
      Ridge score=  0.7452378156489365
      Sklearn score=  0.6638637679078936
      
      Click here for Solution ```python # x_train, x_test, y_train, y_test hold same values as in previous cells RidgeModel=Ridge(alpha=0.1) RidgeModel.fit(x_train, y_train) yhat = RidgeModel.predict(x_test) print(r2_score(y_test,yhat)) ```

      Apply polynomial transformation to the training parameters with degree=2. Use this transformed feature set to fit the same regression model, as above, using the training subset. Print the $ R^2 $ score for the testing subset.

      pl2=PolynomialFeatures(degree=2)
      data_tr_poly2=pl2.fit_transform(xtr)
      data_ts_poly2=pl2.fit_transform(xts)
      RigeModel.fit(data_tr_poly2,ytr)
      predicted=RigeModel.predict(data_ts_poly2)
      plotdists(ydt,ypipt,'Original','Predicted',[f"R\N{superscript two}={round(r2_score(predicted,yts),4)}"])
      

      Lesson Summary

      How to split your data using the train_test_split() method into training and test sets. You use the training set to train a model, discover possible predictive relationships, and then use the test set to test your model to evaluate its performance.
      
      How to use the generalization error to measure how well your data does at predicting previously unseen data.
      
      How to use cross-validation by splitting the data into folds where you use some of the folds as a training set, which we use to train the model, and the remaining parts are used as a test set, which we use to test the model. You iterate through the folds until you use each partition for training and testing. At the end, you average results as the estimate of out-of-sample error.
      
      How to pick the best polynomial order and problems that arise when selecting the wrong order polynomial by analyzing models that underfit and overfit your data.
      
      Select the best order of a polynomial to fit your data by minimizing the test error using a graph comparing the mean square error to the order of the fitted polynomials.
      
      You should use ridge regression when there is a strong relationship among the independent variables.  
      
      That ridge regression prevents overfitting.
      
      Ridge regression controls the magnitude of polynomial coefficients by introducing a hyperparameter, alpha. 
      
      To determine alpha, you divide your data into training  and validation data. Starting with a small value for alpha, you train the model, make a prediction using the validation data, then calculate the R-squared and store the values. You repeat the value for a larger value of alpha. You repeat the process for different alpha values, training the model, and making a prediction. You select the value of alpha that maximizes R-squared.
      
      That grid search allows you to scan through multiple hyperparameters using the Scikit-learn library, which iterates over these parameters using cross-validation. Based on the results of the grid search method, you select optimum hyperparameter values.
      
      The GridSearchCV() method takes in a dictionary as its argument where the key is the name of the hyperparameter, and the values are the hyperparameter values you wish to iterate over.
      try:
          !jupyter nbconvert Data_Analysis_Notes.ipynb --to html --template pj
      except Exception as e:
          print('HTML not stored')
      
      import shutil
      import os
      import shutil
      FromFld='C:\\Users\\Gamaliel\\Documents\\G\\ADD\\IBM_DS\\Data_Analysis_Py\\'
      Tofld='C:\\Users\\Gamaliel\\Documents\\G\\ADD\\IBM_DS\\IBM_DS_Jupyter_Tasks\\Python4DataScience\\'
      HTML_Notes='Data_Analysis_Notes.html'
      Jupyter_Notes='Data_Analysis_Notes.ipynb'
      try:
          if os.path.isfile(Tofld+'/'+HTML_Notes):
              os.remove(Tofld+'/'+HTML_Notes)
              print(HTML_Notes, 'deleted in', Tofld)
              shutil.move(os.path.join(FromFld,HTML_Notes),os.path.join(Tofld,HTML_Notes))
              print(HTML_Notes, 'replaced in', Tofld)
          else:
              shutil.move(os.path.join(FromFld,HTML_Notes),os.path.join(Tofld,HTML_Notes))
              print(HTML_Notes, 'written in', Tofld)
      except Exception as e:
          print('HTML not moved')
      
          # NB
      
      try:
          if os.path.isfile(Tofld+'/'+Jupyter_Notes):
              os.remove(Tofld+'/'+Jupyter_Notes)
              print(Jupyter_Notes, 'deleted in', Tofld)
              shutil.copy(os.path.join(FromFld,Jupyter_Notes),os.path.join(Tofld,Jupyter_Notes))
              print(Jupyter_Notes, 'copied in', Tofld)
          else:
              shutil.copy(os.path.join(FromFld,Jupyter_Notes),os.path.join(Tofld,Jupyter_Notes))
              print(Jupyter_Notes, 'copied in', Tofld)
      except Exception as e:
          print('NB not moved')
      
      Data_Analysis_Notes.html written in C:\Users\Gamaliel\Documents\G\ADD\IBM_DS\IBM_DS_Jupyter_Tasks\Python4DataScience\
      Data_Analysis_Notes.ipynb copied in C:\Users\Gamaliel\Documents\G\ADD\IBM_DS\IBM_DS_Jupyter_Tasks\Python4DataScience\